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Abstract 

We consider learning on graphs, guided by kernels that encode similarity between vertices. Our 
focus is on random walk kernels, the analogues of squared exponential kernels in Euclidean spaces. 
We show that on large, locally treelike, graphs these have some counter-intuitive properties, specifi- 
cally in the limit of large kernel lengthscales. We consider using these kernels as covariance matrices 
of e.g. Gaussian processes (GPs). In this situation one typically scales the prior globally to nor- 
malise the average of the prior variance across vertices. We demonstrate that, in contrast to the 
Euclidean case, this generically leads to significant variation in the prior variance across vertices, 
which is undesirable from the probabilistic modelling point of view. We suggest the random walk 
kernel should be normalised locally, so that each vertex has the same prior variance, and analyse 
the consequences of this by studying learning curves for Gaussian process regression. Numerical 
calculations as well as novel theoretical predictions for the learning curves using belief propagation 
make it clear that one obtains distinctly different probabilistic models depending on the choice of 
normalisation. Our method for predicting the learning curves using belief propagation is signifi- 
cantly more accurate than previous approximations and should become exact in the limit of large 
random graphs. 



1. Introduction 



Gaussian processes (GPs) have become a workhorse for probabilistic inference th at has been de- 



veloped in a wide range of research f i elds under various guises (see for example iKleiinenl . l2009t 
Handcock and Steinl . Il993t iNeall . Il996t IMeinhold and Singpurwallal Il983h . Their success and wide 
adoption can be attributed mainly to their intuitive nature and ease of use. They owe their in- 
tuitiveness to being one of a large family of kernel methods that implicitly map lower dimensional 
spaces with non-linear relationships to higher dimensional spaces where (hopefully) relationships are 
linear. This feat is achieved by using a kernel, which also encodes the types of functions that the 
GP prefers a priori. The ease of use of GPs is due to the simplicity of implementation, at least 
in the basic setting, where prior and posterior distributions are both Gaussian and can be written 
explicitly. 

An important question for any machine learning method is how 'quickly' the method can gener- 
alise its prediction of a rule to the entire domain of the rule (i.e. how many examples are required to 
achieve a particular generalisation error). This is encapsulated in the learning curve, which traces 
average error versus number of examples. Learning curves h ave been studied for a variety of infer- 
ence methods and are well understood for parametric models dSeung et all 1992 1 Amari et all 1992 ; 
Watkin et al. . 1993 ; Qpper and Haussler . 19951 Haussler et al. . 1996 : Freeman and SaadTl997 ) but 
rather less is known for non-parametric models such as GPs. In the case of GP re gression , resear ch 
has predominantly focused on leaning curves for input data from Euclidean spaces |Sollichl . [l999allbl : 
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Qpper and Vivarellil Il999t IWilliams and Vivarelil l2000l ; iMalzahn and Qpperl . l2003t ISollichl 12002 ; 
Sollich and Haleed. 120021 ISollich and Williams! . bOOaT but there are many domains for which the 



input data has a discrete structure. One of the simplest cases is the one where inputs are vertices 
on a graph, with connections on the graph encoding similarity relations between different inputs. 
Examples could include the internet, social networks, protein networks and financial markets. Such 
discrete input spaces with graph structure are becoming more important, and therefore so is an 
understanding of GPs, and machine learning tec hniques in general, o n these spaces. 

In this paper we expand on earlier work in ISollich et al.l (j2009h : lllrrv and Sollich! (|2010l ) and 
focus on predicting the learning curves of GPs used for regression ( where outputs are from the 
whole real line) on large sparse graphs, using the random walk kernel (jKondor and LaffertvL l20"o"3 



Smola and Kondorl . 2003). 



The rest of this paper will be structured as follows. In Section [2] we begin by analysing the 
random walk kernel as a function of its lengthscale and study its approach to the fully correlated 
limit. With a better understanding of the random walk kernel in hand, we proceed in Section[3]to an 
analysis of the use of the random walk kernel for GP regression on graphs. We begin in Section [3~2l 
by looking at how kernel normalisation affects the prior probability over functions. We show that the 
more frequently used global normalisation of a kernel by its average prior variance is inappropriate 
for the highly location dependent random walk kernel, and suggest normalisation to uniform local 
prior variance as a remedy. To understand how this affects GP regression using random walk kernels 
quantitatively, we extend f irst in S ection I3.4 | an existing appro x imati on to the learning curve in 
terms of kernel eigenvalues ISollichl (Il999al ): IOppct and Malzahnl (|2002f) to the discrete input case, 
allowing for arbitrary normalisation. This approximation turns out to be accurate only in the initial 
and asymptotic regimes of the learning curve. 

The core of our analysis begins in Section |4] with the development of an improved learning 
curve approximation based on belief propagation. We first apply this, in Section FTTl to the case 
of globally normalised kernels as originally proposed. The belief propagation analysis for global 
normalisation also acts as a useful warm-up for the extension to the prediction of learning curves for 
the locally normalised kernel setting, which we present in Section [4.31 In both sections we compare 
our predictions to numerical simulations, finding good agreement that improves significantly on the 
eigenvalue approximation. Finally, to emphasise the distinction between the use of globally and 
locally normalised kernels in GP regression, we study qualitatively the case of model mismatch, 
with a GP with a globally normalised kernel as the teacher and a GP with a locally normalised 
kernel as the student. The resulting learning curves show that the priors arising from the two 
different normalisations are fundamentally different; the learning curve can become non-monotonic 
and develop a maximum as a result of the mismatch. We conclude in Section [6] by summarising our 
results and discussing further potentially fruitful avenues of research. 



2. The random walk kernel 



A wide range of machine learning techniques like Gaussian processes capture prior correlations 
between points in an input space by mapping to a higher dimen sional space, where corre l ations 
can be represented by a linear combination of 'f e ature s' (see e.g. Rasmussen and Williams! . 20051 : 
Muller et al. . 2001 ; Cristianini and Shawe-Tavlor . 2000h . Direct calculation of correlations in this 
high dimensional space is avoided using the 'kernel trick', where the kernel function implicitly cal- 
culates inner products in feature space. The widespread use of, and therefore extensive research 
in, kernel b ased ma c hine l earning has resulted in kernels being developed for a wide range of input 
spaces (see iGentonl. |2002|. and refe r ences therein). We focus in this paper on the class of kernels 
introduced in iKondor and Laffertvl (|2002i ) . These make use of the normalised graph Laplacian to 
define correlations between vertices of a graph. 

We denote a generic graph by G(V, £ ) with a vertex set V = {1, . . . , V} and edge set £ . We encode 
the connection structure of G using an adjacency matrix A, where Aij = 1 if vertex i is connected to 
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j, and otherwise; we exclude self-loops so that An = 0. We denote the number of edges connected 
to vertex i, known as the degree, by di = Ay and the degree ma trix J? as a diagonal matrix 
of the vertex degrees, i.e. = diSij. The class of kernels created in Kondor and Laffertvl j2002 ) 



are constructed using the normalised Laplacian, L = I — D^ 1 / 2 AD^ 1 ! 2 fsee IChuna . 1 1996T) as a 
replacement for the Laplacian in continuous spaces. Of particular interest is the diffusion kernel 
and its easier to calculate approximation, the random walk kernel. Both of these kernels can be 
viewed as an approximation to the ubiquitous squared exponential kernel that is used in continuous 
spaces. The direct graph equ ivalent of the squared exponential kernel is given by the diffusion kernel 



(jKondor and Laffertvl . 120021 ). It is defined as 



C = cxp , ff >0, (1) 

where a sets the length-scale of the kernel. Unli ke in continuous spaces, the exponential in the 



diffusion kernel is costly to calculate. To avoid this. lSmola and Kondorl (2003) proposed as a cheaper 
approximation the random walk kernel 

C={l-a- 1 L) P =({\-a- 1 )I + a- 1 D- 1 l 2 AD- 1 l 2 Y , a > 2, peN. (2) 



This gives back the diffusion kernel in the limit a,p — > oo whilst keeping p/a = a 2 /2 fixed. The 
random walk kernel derives its name from its use of random walks to express correlations between 
vertices. Explicitly, a binomial expansion of equation ([5]) gives 

frp \{^ _ n- l \v-i(„-i\<i(n-y 2 4n-V2w 



c = ( „ ) (i - a-y-^a-yiD-^AD-y 

(3) 



p 



= D- 1 ' 2 £ T ) (1 - a-y-^a-yiAD-yD 1 ' 2 

q=Q 



The matrix AD 1 is a random walk transition matrix: (AD~ 1 )ij is the probability of being at vertex 
i after one random walk step starting from vertex j. Apart from the pre- and post-multiplication by 
D~ x / 2 and D 1 ^ 2 , the kernel C is therefore a g-step random walk kernel, averaged over the number of 
steps q distributed as q ~ Binomial(p, a -1 ). An equivalent interpretation is in terms of a p-step lazy 
random walk, where at each step the walker stays at the current vertex with probability (1 — a -1 ) 
and moves to a neighbouring vertex with probability a -1 . 

Using either interpretation, one sees that p/a is the lengthscale over which the random walk 
can diffuse along the graph, and hence the lengthscale describing the typical maximum range of 
the random walk kernel. In the limit of large p, where this lengthscale diverges, the kernel should 
represent full correlation across all vertices. One can see that this is the case by observing that 
for large p, a random walk on a graph will approach its stationary distribution, cx De, e = 
(1,...,1) . The g-step transition matrix for large q is therefore (AD~ 1 ) q sa pooe T = Dee T , 
representing the fact that the random walk becomes stationary independently of the starting vertex. 
This gives, for p — > oo, the kernel C cx D 1 l 2 ee T D 1 / 2 , i.e. Cy oc dl' 2 dj. This corresponds to 
full correlation across vertices as expected; explicitly, if / is a Gaussian process on the graph with 
covariance matrix D 1 l 2 ee T D 1 ^ 2 , then / = vD 1 / 2 e with v a single Gaussian degree of freedom. 

We next consider how random walk kernels on graphs approach the fully correlated case, and 
show that even for 'simple' graphs the convergence to this limit is non-trivial. Before we do so, we 
note an additional peculiarity of random walk kernels compared to their Euclidean counterparts: in 
addition to the maximum range lengthscale p/a discussed so far, they have a diffusive lengthscale 
(7 = {2p/a) 1 / 2 , which is suggested for large p and a by the lengthscale of the corresponding diffusion 
kernel |T]). This diffusive lengthscale will appear in our analysis of learning curves in the large p-limit 
Section EXU 
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2.1 The d-regular tree: A concrete example 

To begin our discussion of the dependence of the random walk kernel on the lengthscale p/a, we first 
look at how this kernel behaves on a ci-regular graph sampled uniformly from the set of all c£-regular 
graphs. Here d-regular means that all vertices have degree di — d. For a large enough number of 
vertices V, typical cycles in such a d-regular graph are also large, of length 0(\ogV), and can be 
neglected for calculation of the kernel when V — > oo. We therefore begin by assuming the graph is 
an infinite tree, and assess how the cycles that do exist on random e?-regular graphs cause departures 
from this picture. 

A e?-regular tree is a graph where each vertex has degree d with no cycles; it is unique up to 
permutations of the vertices. Since all vertices on the tree are equivalent, the random walk kernel 
Cij can only depend on the distance between vertices i and j, i.e. the smallest number of steps on 
the graph required to get from one vertex to the other. Denoting the value of a p-step lazy random 
walk kernel for vertices a distance I apart by Ci iP , we can determine these values by recursion over 
p as follows: 

Ci iP= o = Sifi, 7 P +iCo,p+i = ( 1 j Co,p H — Ci iP 

x a (4) 

1 „ L l\ „ d-l 



7p+iC(,p+i — ~adF l ~ 1,p \ a) ^ L ' p ^ ad~^ l+1 ' p ' ~ ^' 

Here j p is chosen to achieve the desired normalisation of the prior variance for every p. We will 
normalise so that Co lP = 1. 

Figure [T] (left) shows the results obtained by iterating equation Q numerically for a 3-regular 
tree with a — 2. As expected the kernel becomes longer-ranged initially as p is increased, but seems 
to approach a non-trivial limiting form. This can be calculated as (see Appendix |A")) 



C, 



ltd -2) 
1+ V . 



(5) 



(d-l)'/ 2 ' 



Equation ([5]) can be derived by ta king the a 2 — > oo limit of the integral expression for the diffusion 
kernel from ung and Yaul dl999h whilst preserving normalisation of the kernel (see Appendix IA.1I 



for further details). Alternatively the result ([5]) can be obtained by rewriting the random walk 
in terms of shells, i.e. grouping vertices according to distance I from a chosen central vertex. The 
number of vertices in the l-th shell, or shell volume, is vi = d(d— 1) for I > 1 and vq = 1. Introducing 
Ri,p = C^py/vl, Equation (j4]) can be written in the form 



Ri.p=a = Si.o, "tp+\Ro, P +i — [ 1 — — j Ro, P + —Rl,p 



7p+i-R;,p+i — , Ri-i,p + — — \ Rip + — Ri+i,p I > 1- 



(6) 



This is just the un-normalised diffusion equation for a biased ra ndom walk on a one dimen sional 
lattice with a reflective boundary at 0. This has been solved in iMonthus and Texier (|l996lh and 



mapping this solution back to C/ iP gives §5§ (see Appendix I A. 2 1 for further details). 

To summarise thus far, the analysis on a d-regular tree shows that, for large p, the random walk 
kernel does not approach the expected fully correlated limit: because all vertices have the same 
degree this limit would correspond to C^p-^oa = 1. On the other hand, on a d-regular graph with 
any finite number V of vertices, the fully correlated limit must necessarily be approached as p — > oo. 
As a large regular graph is locally treelike, the difference must arise from the existence of long cycles 
in a regular graph. 

To estimate when the existence of cycles will start to affect the kernel, consider first a c?-regular 
tree truncated at depth I. This will have V = 1 + d(d — l) 1 ^ 1 vertices. On a d-regular graph with 



4 



Gaussian processes on random graphs 




Figure 1: (Left) Random walk kernel C/, p on a 3- regular tree plotted against distance I for increasing 
number of steps p and a — 2. (Right) Comparison between numerical results for the 
average nearest neighbour kernel K± >p on random 3-regular graphs with the result C\ tP on 
a 3-regular tree, calculated numerically by iteration of @. 



the same number of vertices, we therefore expect to encounter cycles after a number of steps, taken 
along the graph, of order I, In the random walk kernel the typical number of steps is p/a, so effects 
of cycles should appear once p/a becomes larger than 

p ^ log(V) 

a ~ log(<2- 1)' 1 1 

Figure [1] (right) shows a comparison between Ci jP as calculated from equation @ for a 3-regular 
tree and its analogue on random 3-regular graphs of finite size, which we call K\ p . We define this 
analogue as the average of Cy/ wCuCjj over all pairs of neighbouring vertices on a fixed graph, 
averaged further over a number of randomly generated regular graphs. The square root accounts for 
the fact that local kernel values Cu can vary on a regular graph because of cycles, while they are the 
same for all vertices of a regular tree. Looking at Figure [T] (right) one sees that, as expected from 
the arguments above, the nearest neighbour kernel value for the 3-regular graph, -Ki lP , coincides 
with its analogue C\ tP on the 3-regular tree for small p. When p/a crosses the threshold (O, cycles 
in the regular graph become important and the two curves separate. For larger p, the kernel value 
for neighbouring vertices approaches the fully correlated limit K\^ v — ¥ 1 on a regular graph, while 
on a regular tree one has the non-trivial limit Ci lP — ¥ 2y ' d —1/d from ([5|). 

In conclusion of our analysis of random walk kernels, we have seen that these kernels have 
an unusual dependence on their lengthscale p/a. In particular, kernel values for vertices a short 
distance apart can remain significantly below the fully correlated limit even if p/a is large. That 
limit is approached only once p/a becomes larger than the graph size-dependent threshold ([7]), at 
which point cycles become important. We have focussed here on random regular graphs, but the 
same qualitative behaviour should be observed also on graphs with a non-trivial distribution of 
vertex degrees di. 
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3. Learning curves for Gaussian process regression 

Having reached a better understanding of the random walk kernel we now study its application 
in machine learning. In particular we focus on the use of the random walk kernel for regression 
with Gaussian processes. We will begin, for completeness, with an introduction to GPs for regres- 
sion. For a more compr e hensiv e discussion of GPs for machine learning we direct the reader to 
Rasmussen and Williams! ( 2005 ). 



3.1 Gaussian process regression: kernels as covariance functions 

Gaussian process regression is a Bayesian inference technique that constructs a posterior distribution 
over a function space, P(f\x, y), given training input locations x = [x\, . . . , xn) t and corresponding 
function value outputs y = (yi, . . . , yjv) T . The posterior is constructed from a prior distribution 
P(f) over the function space and the likelihood P(y\f,x) to generate the observed output values 
from function / by using Bayes' theorem 

P(f\xy) = (8) 

nmv) fdf>p( v \f>,x)p(f>y [H) 

In the GP setting the prior is chosen as a Gaussian process, where any finite number of function 
values has a joint Gaussian distribution, with a covariance matrix with entries given by a covariance 
function or kernel C(x, x') and with a mean vector with entries given by a mean function /i(x). For 
simplicity we will focus on zero mean GPfl and a Gaussian likelihood, which amounts to assuming 
that training outputs are corrupted by independent and identically distributed Gaussian noise. 
Under these assumptions all distributions are Gaussian and can be calculated explicitly. If we 
assume we are given training data {(a; M , yp)\fJt = 1, . . . , N} where is value of the target or 'teacher' 
function at input location x^, corrupted by additive Gaussian noise with variance a 2 . The posterior 
distribution is then given by another Gaussian process with mean and covariance functions 

f(x) = k{xfK- l y (9) 
Cov(x, x') = C(x, x') - k{x) T K- l k{x') (10) 

where k(x*) = (C(xi,x*), . . . ,C(xn,x*)) t and (K),^ = C{x v ,x,j) + S L ,^a 2 . With the posterior 
in the form of a Gaussian process, predictions are simple. Assuming a squared loss function, the 
optimal prediction of the outputs is given by f(x) and a measure of uncertainty in the prediction is 
provided by Cov(x, x) 1 / 2 . 

Equations © and (fTU|) illustrate that, in the setting of GP regression, kernels are used to change 
the type of function preferred by the Gaussian process prior, and correspondingly the posterior. The 
kernel can encode prior beliefs about smoothness properties, lcngthscale and expected amplitude of 
the function we are trying to predict. Of particular importance for the discussion below, C(x,x) 
gives the prior variance of the function / at input x, so that C(x,x) 1 / 2 sets the typical function 
amplitude or scale. 

3.2 Kernel normalisation 

Conventionally one fixes the desired scale of the kernel using a global normalisation: denoting the 
unnormalised kernel by C{x,x') one scales C(x,x') — C(x,x')/k to achieve a desired average of 
C(x, x) across input locations x. In Euclidean spaces one typically uses translationally invariant 
kernels like the squared exponential kernel. For these, C(x,x) is the same for all input locations x 
and so global normalisation is sufficient to fix a spatially uniform scale for the prior amplitude. In 
the case of kernels on graphs, on the other hand, the local connectivity structure around each vertex 



1. In the discussion and analysis that follows, generalisation to non-zero mean GPs is straightforward. 
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can be different. Since information about correlations 'propagates' only along graph edges, graph 
kernels are not generally translation invariant. In particular, there can be large variation among the 
prior variances at different vertices. This is usually undesirable in a probabilistic model, unless one 
has strong prior knowledge to justify such variation. For the random walk kernel, the local prior 
variances are the diagonal entries of equation <j3j) . These are directly related to the probability of 
return of a lazy random walk on a graph, which depends sensitively on the local graph structure. 
This dependence is in general non-trivial, and not just expressible through e.g. the degree of the 
local vertex. It seems difficult to imagine a scenario where such a link between prior variances and 
local graph structures could be justified by prior knowledge. 

To emphasise the issue, Figure [2] shows examples of distributions of localprior variances Cu for 
random walk kernels globally normalised to an average prior variance of unitylj The distributions are 
peaked around the desired value of unity but contain many 'outliers' from vertices with abnormally 
low or high prior variance. Figur e [2] (left) shows the distr ibution of Cu for a large single instance 



of a Erdos-Renyi random graph (jErdos and Renvil . Il959h . In such graphs, each edge is present 



independently of the others with some fixed probability, giving a Poisson distribution of degrees 
p\(d) = X d exp(— X)/dl; for the figure we chose average degree A = 3. Figure [2] (right) shows 



analogous results for a generalised random graph with power law mixing distribution (Br itton et al 
2006). Generalised random graphs are an extension of Erdos-Renyi random graphs where different 
edges are assig n ed diff erent probabilities of being present. By appropriate choice of these probabilities 



(Britton et al 



l2006h . one can generate a degree distribution that is a superposition of Poisson 
distributions, p(d) = f dXp\(d)p(X). We have taken a shifted Pareto distribution, p(X) = aA^/A Q+1 
with exponent a — 2.5 and lower cutoff X m = 2 for the distribution of the means. 

Looking first at Figure [5] (left), we know that large Poisson graphs are locally tree-like and hence 
one might expect that this would lead to relatively uniform local prior variances. As shown in the 
figure, however, even for such tree-like graphs large variations can exist in the local prior variances. 
To give some specific examples, the large spike near is caused by single disconnected vertices and 
the smaller spike at around 6.8 arises from two- vertex (single edge) disconnected subgraphs. Single 
vertex subgraphs have an atypically small prior variance since, for a single disconnected vertex i, 
before normalization Cu = (1 — a _1 ) p which is the q — contribution from equation ([3]). Other 
vertices in the graph will get additional contributions from q > 1 and so have a larger prior variance. 
This effect will become more pronounced as p is increased and the binomial weights assign less weight 
to the q = term. 

Somewhat surprisingly at first sight, the opposite effect is seen for two- vertex disconnected sub- 
graphs as shown by the spike around Cu = 6.8 in Figure [2] (left). For vertices on such subgraphs, 

Cji — Y^q=o^ {2q) a ~ 2q 0- ~ a^ 1 ) p ^ 2q , which is an atypically large return probability: after any even 
number of steps, the walker must always return to its starting vertex. A similar situation would 
occur on vertices at the centre of a star. This illustrates that local properties of a vertex alone, like 
its degree, do not constrain the prior variance. In a two-vertex disconnected subgraph both vertices 
have degree 1. But there will generically be other vertices of degree 1 that are dangling ends of a 
large connected graph component, and these will not have similarly elevated return probabilities. 
Thus, local graph structure is intertwined in a complex manner with local prior variance. 

The black line in Figure [2] (left) shows theoretical predictions (see Section 14. 2p for the prior 
variance distribution in the large graph limit. There is significant fine structure in the various 
peaks, on which theory and simulations agree well where the latter give reliable statistics. The 
decay from the mean is roughly exponential (see linear- log plot in inset), emphasizing that the 
distribution of local prior variances is not only rather broad but can also have large tails. 

For the power law random graph, Figure [2] (right), the broad features of the distribution of local 
prior variances Cu are similar: a peak at the desired value of unity, overlaid by spikes which again 
come from single and two- vertex disconnected subgraphs. The inset shows that decay from the mean 



2. We use Cu again here, instead of C(i,i) as in our general discussion of GPs; the subscript notation is more 
intuitive because the covariance function on a graph is just a V X V matrix. 
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Figure 2: (Left) Grey: histogram of prior variances for the globally normalised random walk kernel 
with a = 2, p — 10 on a single instance of a Poisson (Erdos-Renyi) graph with mean 
degree A = 3 and V = 10000 vertices. Black: prediction for this distribution in the large 
graph limit (see Section |4~2]) . Inset: Linear-log plot of the tail of the distribution. (Right) 
As (left) but for a power law generalised random graph with exponent 2.5 and cutoff 2. 



is roughly exponential again, but with a slower decay; this is to be expected since power law graphs 
exhibit many more different local structures with a significantly larger probability than is the case 
for Poisson graphs. Accordingly, the distribution of the Cu also has a larger standard deviation than 
for the Poisson case. The maximum values of Cu that we see in these two specific graph instances 
follow the same trend, with max^ Cu « 40 for the power law graph and max^ Cu ~ 15 for the Poisson 
graph. Such large values would constitute rather unrealistic prior assumptions about the scaling of 
the target function at these vertices. 

To summarise, Figure [2] shows that after global normalisation a random walk kernel can retain 
a large spread in the local prior variances, with the latter depending on the graph structure in a 
complicated manner. We propose that to overcome this one should use a local normalisation. For 
a desired prior variance c this means normalising according to CV, = cCij /(KiKj) 1 ^ 2 with local nor- 
malisation constants Ki = Cu] here Cij is the unnormalized kernel matrix as before. This guarantees 
that all vertices have exactly equal prior variance as in the Euclidean case. No uncontrolled local 
variation in the scaling of the function prior then remains, and the computational overhead of local 
over global normalisation is negligible. The effect of this normalisation on the behaviour of GP 
regression is a key question for the remainder of this paper; numerical simulation results are shown 
in Section 1331 below, while our theoretical analysis is described in Section [4] 

3.3 Predicting the learning curve 

The performance of non-parametric methods such as GPs can be characterised by studying the 
learning curve 

^) = ((((^X>-(/*W 2 ) ) ) ) (ii) 

\ \ \ \ (=1 / y\g,xl gl xl Q 
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defined as the average squared error between the student and teacher's predictions / = (/(l), . . . , /(V)) T 
and g = (g(l), ■ . ■ ,g(V)) T respectively, averaged over the student's posterior distribution given the 
data f\x,y, the outputs given the teacher y\g, x, the teacher functions g, and the input locations 
x. This gives the average generalisation error as a function of the number of training examples. For 
simplicity we will assume that the input distribution is uniform across the vertices of the graph. 

Because we are analysing GP regression on graphs, after the averages discussed so far the gen- 
eralisation error will still depend on the structure of the specific graph considered. We therefore 
include an additional average, over all graphs in a random graph ensemble Q . We consider graph 
ensembles here defined by the distribution of degrees df. we specify a degree sequence {d\, . . . , dy}, 
or, for large V, equivalently a degree distribution p{d), and pick at random any one of the graphs 
that has this degree distribution. The actual shape of the degree distribution is left arbitrary, as 
long as it has finite mean. Our analysis therefore has broad applicability, including in particular the 
graph types already mentioned above (d- regular graphs, where p(d') = 8441 , Poisson graphs, power 
law generalised random graphs). 

For this paper, as is typical for learning curve studies, we will assume that teacher and student 
have the same prior distribution over functions, and likewise that the assumed Gaussian noise of 
variance a 2 reflects the actual noise process corrupting the training data. This is known as the 
matched cas^. Under this assumption the generalisation error becomes the Bayes error, which given 
that we are considering squared error simplifies to the posterior variance of the student averaged over 
data sets and graphs (jRasmussen and Williams! . 120051 ) . Since we only need the posterior variance we 
shift / so that the posterior mean is 0; /j is then just the deviation of the function value at vertex 
i from the posterior mean. The Bayes error can now be written as 



e(N) = 




(12) 



Note that by shifting the posterior distribution to zero mean, we have eliminated the dependence 
on y in the above equation. That this should be so can also be seen from (ITU1) for the posterior 
(co-)variance, which only depends on training inputs x but not the corresponding outputs y. 

The averages in Equation (fl~2|) are in general difficult to calculate analytically, because the train- 
ing input locations x ente r in a highly nonlinear matter, cf. (fTOl); only for very specifi c situations can 
exact results be obtained ( Malzahn and Qpper . 2005 ; Rasmussen and Williamsl . 12005). Approximate 
learn ing curve pred ic tions have been derived, fo r Euclidean input spaces, with some degree of suc- 
cess dSollichl. [l999a.b; Qpper and Vivarellil. [l9 99; Will iams and Vivarelhl. l2000HMalzahn and Qpperl . 
l2003t ISollichl . 120021 ISollich and Haleesl . l2002t ISollich and Williams! . l2005h . We will show that in the 
case of GP regression of functions defined on graphs, learning curves can be predicted exactly in the 
limit of large random graphs. This prediction is broadly applicable because the degree distribution 
that specifies the graph ensemble is essentially arbitrary. 



It i s instructive to begin our an alysis by extending a previous approximation seen in ISollich 
(|1999al ); iMalzahn and Qpperl (|2005l ) to our discrete graph case. In so doing we will see explicitly 
how one may improve this approximation to fully e xploit the structure of ra ndom graphs, using 
belief propagation or equivalently the cavity method ( Mezard and ParisH 2003). We wil l sketc h the 
derivation of th e existing appr oximation following the method of IMalzahn and Qpperl (|2005l ): the 
result given by ISollich ( 1999a ) is included in this as a somewhat more restricted approximation. 
Both the approximate treatment and our cavity method take a statistical mechanics approach, so 
we begin by rewriting equation (1121) in terms of a generating or partition function Z 



e(N) 



(13) 



3. The case of mismatch has b een considered inlMalza hn and Qpperl i2005f) for fixed teacher functions, and for prior 
and noise level mismatch in ISollich! j2002t ): ISollich and Williamsl i2005f T 
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with 



Z = 



(14) 



In this representation the inputs a; only enter Z through the sum over /it. We introduce 7, to count 
the number of examples at vertex i so that Z becomes 



Z = J d/expf-i/ T C-V-~/ T diagg+A}/ 



(15) 



The average in equation (|13[) of the logarithm of this partition function can still not be carried 
out in closed form. The approximation given by lMalzahn and Opper ( 2005 ) and our present cavity 
approach diverge at this point. Section 13.41 discusses the existing approximation for the learning 
curve, applied to the case of regression on a graph. Section U then improves on this using the cavity 
method to fully exploit the graph structure. 



3.4 Kernel eigenvalue approximation 

The a pproach of Malzahn and Opper ( 20051 ) is to average \og(Z) from (fT5|) using the replica trick ( Mezard et al 
19871 ). One writes (logZ} x = lim n _yo „ l g(-^™)x, performing the average (Z n } x for integer n and 
assuming that a continuation to n — > is possible. The required n-th power of equation (|15|) is given 
by 



/fldr/exp (-~£(r) T c-v°-^£ 

a—l \ \ a i,a 



niftf 




(16) 



where the replica index a runs from 1 to n. Assuming as before that examples are generated 
independently and uniformly from V, the dataset average over the 7^ will, for large V, become 
equivalent to independent Poisson averages over 7^ with v ~ N/V. Explicitly performing these 
averages gives 



(Z n ) 



/ndrexpUi^rfc-r+^EO 

a— 1 \ a i 



■£„(/f) 2 /2<x 2 _ I 



X 



(17) 



In order to evaluate 
Malzahn and Qpperl 



one has to find a way to deal with the exponential term in the exponent. 
20051 ) do this using a variational approximation for the distribution of the f a , 



of Gauss ian form. Eve ntually this leads to the following eigenvalue learning curve approximation 
(see alsolSollich] (|l999al )): 



e(iV) 



N 



e(N) 



9(h) = £ (K 1 



(18) 



The eigenvalues X a of the kernel are defined here from the eigenvalue equatior0 (1/V) J2j Cij4>j — 
Atfii. The Gaussian variational approach is evidently justified for large tr 2 , where a Taylor expansion 
of the exponential term in (1171) can be truncated after the quadratic term. For small noise levels, on 
the other hand, the Gaussian variational approach will in general not capture all the details of the 
fluctuations in the numbers of examples 7$ . This issue is expected to be most prominent for values 
of v of order unity, where fluctuations in the number of examples are most relevant because some 



4. Here and below we consider the case of a uniform distribution of inputs across vertices, though the results can be 
generalised. 
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v = N/V 

Figure 3: (Left) Learning curves for GP regression with globally normalised kernels with p = 10, 
a — 2 on 3-regular random graphs for a range of noise levels a 2 . Dashed lines: eigenvalue 
predictions (see Section I3T4")) . solid lines with filled circles: numerically simulated learning 
curves for graphs of size V = 500, solid lines with triangles: cavity predictions (see Section 
14. ip . (Centre) As (left) for Poisson random graphs with mean degree 3. (Right) As (left) 
for power law generalised random graphs with exponent 2.5 and cutoff 2. 



vertices will not have seen examples locally or nearby and will have posterior variance close to the 
prior variance, whereas those vertices with examples will have small posterior variance, of order a 2 . 
This effect disappears again for large v, where the 0(y/U) fluctuations in the number of examples 
at each vertex become small in relative terms. Mathematically this can be seen from the term 
proportional to v in (fTT]) . which for large v ensures that only values of /f with exp(— J^aif?) 2 /^ 2 ) 
close to 1 will contribute. A quadratic approximation is then justified even if a 2 is not large. 

Learning curve predictions from equation (| 18[) using numerically computed eigenvalues for the 
globally normalised random walk kernel are shown in Figure [3] as dashed lines for random regular 
(left), Poisson (centre) and power law generalised random graphs (right). The predictions are com- 
pared to numerically simulated learning curves shown as solid lines with filled circles, for a range of 
noise levels. Consistent with the discussion above, the predictions of the eigenvalue approximation 
are accurate where the Gaussian variational approach is justified, i.e. for small and large v. Figure 
[3] also shows that the accuracy of the approximation improves as the noise level a 2 becomes larger, 
again as expected by the nature of the Gaussian approximation. 

3.4.1 Learning curves for large p 

Before moving on to the more accurate cavity prediction of the learning curves, we now look at 
how the learning curves for GP regression on graphs depend on the kernel lengthscale p/a. We 
focus for this discussion on random regular graphs, where the distinction between global and local 
normalisation is not important. In Section al! we saw that on a large regular graph the random walk 
kernel approaches a non-trivial limiting form for large p, as long as one stays below the threshold 
([7|) for p where cycles become important. One might be tempted to conclude from this that also 
the learning curves have a limiting form for large p. This is too naive, however, as one can see by 
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considering e.g. the effect of the first example on the Bayes error. If the example is at vertex i, the 
posterior variance at vertex j is, from (|10l) . Cjj — Cfj/(Cu + a 2 ). As the prior variances Cjj are all 
equal, to unity for our chosen normalisation, this is 1 — Cfj/(1 + a 2 ). The reduction in Bayes error is 



12 

of the location of the example vertex i, and in the notation of Section [2.11 can be written as 



therefore e(0) — e(l) = (1/V) Cf//(l + cr )■ As long as cycles are unimportant this is independent 



1 p 



<°)-^ = T-^J2^ c t P ( 19 ) 



where vi is as before the number of vertices a distance I away from vertex i, i.e. Vo = 1, vi = d(d— l) i_1 
for Z > 1. To evaluate (fT9"f for large p, one cannot directly plug in the limiting kernel form (JSJ) : the 
'shell volume' vi just balances the Z-dependence of the factor (d — l) - '/ 2 from Ci tP , so that one gets 
contributions from all distances Z, proportional to I 2 for large Z. Naively summing up to I = p would 
give an initial decrease of the Bayes error growing as p 3 . This is not correct; the reason is that 
while Ci yP approaches the large p- limit ([5]) for any fixed Z, it does so more and more slowly as Z 
increases. A more detailed analysis, sketched in Appendix IA. 21 shows that for large Z and p, Ci :P is 
proportional to the large p-limit l(d — 1)~'/ 2 up to a characteristic cutoff distance Z of order p 1 ^ 2 , 
and decays quickly beyond this. Summing in (|19[) the contributions of order I 2 up to this distance 
predicts finally that the initial error decay should scale, non-trivially, as p 3 ! 2 . 

We next show that this large p-scaling with p 3 ! 2 is also predicted, for the entire learning curve, 
by the eigenvalue approximation (fT8"|). As before we consider d-regular random graphs. The required 



spect rum of kernel eigenvalues A a becomes identical, for large V, to that on a d- regular tree (jMcKav . 
19811) . Explicitly, if A„ are the eigenvalues of the normalised graph Laplacian on the tree, then the 
kernel eigenvalues are X a — k _1 V a_1 (1 — A„/a) p . Here the factor V^ 1 comes from the same factor in 



the kernel eigenvalue definition after (|18[) , and k is the overall normalisation c onstant which enforces 
A r, = V~ Y J2i Cjj — !• The spectrum of the tree Laplacian is known (see McKay, 1981 : Chungl . 



199G): 



y/^-(A^l)2 

^(A^) = { 2Kd\H2-\ L ) — — + (20) 

otherwise 

where A± = 1 ± \{d - l) 1 / 2 . (There are also two isolated eigenvalues at and 2, which do not 
contribute for large V.) 

We can now write down the function g from (|18[) . converting the sum over kernel eigenvalues to 
V times an integral over Laplacian eigenvalues for large V. Dropping the L superscript, the result 
is 

g(h) = J + dAp(A)[/s(l - X/a)- p + hV' 1 ]- 1 (21) 

The dependence on hV~ l here shows that in the approximate learning curve (|18[) . the Bayes error 
will depend only on v — N/V as might have been expected. The condition for the normalisation 
factor k becomes simply g(0) = 1, or kT 1 — J dAp(A)(l — X/a) p . 

So far we have written down how one would evaluate the eigenvalue approximation to the learning 
curve on large d-regular random graphs, for arbitrary kernel parameters p and a. Now we want to 
consider the large p- limit. We show that there is then a master curve for the Bayes error against 
vp 3 l 2 . This is entirely consistent with the p 3 / 2 scaling found above for the initial error decay. 
The intuition for the large p analysis is that the factor (1 — \/a) p decays quickly as the Laplacian 
eigenvalue A increases beyond A_, so that only values of A near A_ contribute. One can then 
approximate 



12 



Gaussian processes on random graphs 



Similarly one can replace p(A) by its leading square root behaviour near A_, 

, (77 _ iU/4j5/2 

Substituting these approximations into (I21[) and introducing the rescaled integration variable y = 
p( A — A_ )/ (a — A_ ) gives 

/ — A \ 3 ^ 2 

g (h) =rK- 1 (l-\-/ay ( a - j F(/Me- 1 V- 1 (l-A_/a) 1 '), (24) 

where r = (d - l) 1/4 (i 5/2 /(7r(d - 2) 2 ) and F(z) = / °° dy j/ 1/2 (exp(y) + z)" 1 . Since g(0) = 1, the 
prefactor must equal 1/F(0) = 2/y / 7r. This fixes the normalisation constant k, and we can simplify 
to 

, , FthV^c- 1 ) , , /a-A_\ 3/2 

ffW = ^ ■ c = rF(Q) — — . (25) 



F(0) ' w V P 

The learning curves for large p are then predicted from (|18[) by solving 

e = J P(^c- 1 /(e + O)/^(0) (26) 

and depend clearly only on the combination vc~ x . Because c is proportional to p~ 3 / 2 , this shows 
that learning curves for different p should collapse onto a master curve when plotted against vp^l" 1 . 

A plot of the scaling of the eigenvalue learning curve approximations onto the master curve is 
shown in Figure [4] (left). As can be seen, large values of p are required in order to get a good collapse 
in the tail of the learning curve prediction, whereas in the initial part the p 3 ^ 2 scaling is accurate 
already for relatively small p. 

Finally, Figure 0] (right) shows that the predicted p 3 / 2 -scaling of the learning curves is present 
not only within the eigenvalue approximation, but also in the actual learning curves. Figure HI (right) 
displays numerically simulated learning curves for p = 5, 10, 15 and 20, against the rescaled number 
of examples vp^l 2 as before. Even for these comparatively small values of p one sees that the rescaled 
learning curves approach a master curve. 



4. Exact learning curves: Cavity method 

So far we have discussed the eigenvalue approximation to GP learning curves, and how it deviates 
from numerically exact simulated learning curves. As discussed in Section 13. 4| the deficiencies of 
the eigenvalue approximation can be traced back to the fact that the fluctuations in the number 
of training examples seen at each vertex of the graph cannot be accounted for in detail. If in 
the average over data sets these fluctuations could be treated exactly, one could hope to obtain 
exact or at least very accurate learning curve predictions. In this section we show that this is indeed 
possible in the case of a random walk kernel, for both global and local nor malisations. We derive ou r 
prediction using belief propagation or, equivalently, the cavity method (jMezard and Parisil 12003). 
The approach relies on the fact that the local structure of the graph on which we are learning is 
tree-like. This local tree-like structure always occurs in large random graphs sampled uniformly 
from an ensemble specified by an arbitrary degree distribution, which is the scenario we consider 
here. We will see that already for moderate graph sizes of V = 500, our predictions are nearly 
indistinguishable from numerical simulations. 

In order to apply the cavity method to the problem of predicting learning curves we must first 
rewrite the partition function (|15p in the form of a graphical model. This means that the weights 
being integrated over to obtain Z must consist of factors relating only to individual vertices, or to 
pairs of neighbouring vertices. The inverse of the covariance matrix in (|15|) creates factors linking 
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Figure 4: (Left) Eigenvalue approximation for learning curves on a random 3-regular graph, using 
a random walk kernel with a = 2, a 2 = 0.1 and increasing values of p as shown. Plotting 
against vp^l 2 shows that for large p these rescaled curves approach the master curve 
predicted from (126|) . though this approach is slower in the tail of the curves. (Right) As 
(left), but for numerically simulated learning curves on graphs of size V = 500. 



vertices at arbitrary distances along the graph, and so must be eliminated before the cavity method 
can be applied. We begin by assuming a general form for the normalisation of C that encompasses 
both local and global normalisation and set C = XT 1/2 [(1 - a" 1 )/ + a" 1 D- 1 / 2 AD^I^Kr 1 ! 2 
with ICij = KiSij. To eliminate interactions across the entire graph we first Fourier transform the 
prior term exp(— \f T C^ 1 f) in (fTS"]) . introducing Fourier variables h, and then integrate out the 
remaining terms with respect to / to give 

z <x ii(^ +x ) 1 J dh exp (~l hT ch - \ hJ diag ((3 + A ) _1 ) h ) ■ (27) 

The coupling between different vertices in Equation (|27|l is now through C so still links vertices up 
to distance p. To reduce these remaining interactions to ones among nearest neighbours only, one 
exploits the binomial expansion of the random walk kernel given in equation ([3]). Defining p addi- 
tional variables at each vertex as h q = K 1 / 2 (D~ 1 / 2 AD~ 1 / 2 ) q K~ 1 / 2 h, q = l,...,p, and abbreviating 
Cg = (P)(l- a - 1 ) p - q (a- 1 ) q , the interaction term h T Ch turns into a local term J2q=o c q {h°) T K' 1 h q . 
(Here we have, for the sake of uniformity, written h° instead of h.) Of course the interactions have 
only been 'hidden' in the h q , but the key point is that the definition of these additional variables can 
be enforced recursively, via h q — K 1 ^ 2 'D^ 1 / 2 AD~ 1 / 2 K~ 1 t 2 h q ~~ 1 . We represent this definition via a 
Dirac delta function (for each q — 1, . . . ,p) and then Fourier transform the latter, with conjugate 
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variables h q , to get 



z«nB+ A ) 1 / n d ^ 9 n d ^ ex p ( - ^ e ^t^-v 

i a q=0 q=l \ g=0 

- i(h°) T diag (Q + A) h° + (/»« - K^D-^AD-^K- 1 / 2 ^- 1 



(28) 



Because the graph adjacency matrix A now appears at most linearly in the exponent, all interactions 
are between nearest neighbours only. We have thus expressed our partition function Z as one of a 
(complex-valued) graphical model. 

4.1 Global normalisation 

We can now apply belief propagation to the calculation of marginals for the above graphical model. 
We focus first on the simpler case of a globally normalised kernel where Ki = k for all i. Rescaling 
each hf to d^J' 'k 1 / 2 ^ and hf to d^J 2 'h\ /k 1 ^ 2 we are left with 

z « n 6 + ^ - 1 / n d*« n n f 4 £ * - * 

t J q=0 q=l i \ 9=0 




(ij) V 9=1 



9-1 x 



(29) 



where the interaction terms coming from the adjacency matrix, A, have been written explicitly as 
a product over distinct graph edges 

To see how the Bayes error (|T2"|) can be obtained from this partition function, we differentiate 
log(Z) with respect to A as prescribed by (fT3|) to get 

, x , 1 1 ( ^k((/i°) 2 )\ . , 

^) = !ToF?|fTA( 1 -iM i J- (30) 

In order to calculate the Bayes error we therefore require specifically the marginal distributions of 
hf. These can be calculated using the cavity method: for a large random graph with arbitrary fixed 
degree sequence the graph is locally tree-like, so that if vertex i were eliminated the corresponding- 
subgraphs (locally trees) rooted at the neighbours j € A/"(i) of i would become approximately 
independent. The resulting cavity marginals created by removing i, which we denote Pj (hj, hj\x), 
can then be calculated iteratively within these subgraphs using the update equations 

lf(hj,hj\x) « exp Llj^c^h^ - l^L + ij^d^ 



q=Q '■>' q=l 



f ft dh k dh k exp ( -iE^i^r 1 + KhT 1 )) Pl 3) (h k ,h k \x). 

keJV(j)\i V q=l J 



(31) 



In terms of the sum-product formulation of belief propagation , the cavity m arginal on the left is the 
message that vertex j sends to the factor in Z for edge (jBishod . 120071) . 

One sees that the cavity update equations (|31l) are solved self-consistently by complex-valued 
Gaussian distributions with mean zero and covariance matrices V- . This Gaussian character of 
the solution was of course to be expected because in (|29p we have a Gaussian graphical model. 
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By performing the Gaussian integrals in the cavity update equations explicitly, one finds for the 
corresponding updates of the covariance matrices the rather simple form 



k£j*f(J)\i 

where we have defined the (2p + 1) x (2p + 1) matrices 



(32) 



Oj = di 
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At first glance 

o, v;! \xv s / x 

all the non-singular terms. We may then apply the Woodbury identity (|rlager 
matrix inverse in a form where the A — > limit can be taken without difficulties: 



(33) 



becomes singular for 7 = 0; however this is easily avoided. We introduce 
= Mj + [djK/(jj/a 2 + A)]e eQ with = (1,0, . ■ . , 0) so that Mj contains 



1989) to write the 



d-l 



Oj-y-xv^x 



fe=l 



= M7 1 



Mj'eoe^M 



-l 

3 



A)/(d jK ) + e^M- i e 



(34) 



In our derivation so far we have assumed a fixed graph, we therefore need to translate these 
equations to the setting we ultimately want to study, i.e. an ensemble of large random graphs. This 
ensemble is characterised by the distribution p(d) of the degrees di, with every graph that has the 
desired degree distribution being assigned equal probability. Instead of individual cavity covariance 
matrices Vj , one must then consider their probability distribution W(V) across all edges of the 
graph. Picking at random an edge of a graph, the probability that vertex j will have degree 
dj is then p(dj)dj / d, because such a vertex has dj 'chances' of being picked. (The normalisation 
factor is the average degree d — YliP{di)di-) Using again the locally treelike structure, the incoming 
(to vertex j) cavity covariances will be independent and identically distributed samples from 
W(V). Thus a fixed point of the cavity update equations corresponds to a fixed point of an update 
equation for W(V): 



W (V) = E ^ (/ II d ^ wmslv- (o-g xv k x 




(35) 



Since the vertex label is now arbitrary, we have omitted the index j. The average in (|35l) is over the 
distribution of the number of examples 7 = 7j at vertex j. As before we assume for simplicity that 
examples are drawn with uniform input probability across all vertices, so that the distribution of 7 
is simply 7 ~ Poisson(i/) in the limit of large N and V at fixed v = N/V . 

In general equation ([33]) - which can also be formally derived using the replica approach (see 
Urry and Sollichl . 2012) - cannot be solved an alytically, but we can tackle it numerically using 
population dynamics ( Mezard and Parisi . 200lh ■ This is an iterative technique where one creates a 
population of covariance matrices and for each iteration updates a random element of the population 
according to the delta function in (|35|) . The update is calculated by sampling from the degree 
distribution p{d) of local degrees, the Poisson distribution of the local number of examples v and 
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from the distribution W(Vi) of 'incoming' covariance matrices, the latter being approximated by 
uniform sampling from the current population. 

Once we have W(V), the Bayes error can be found from the graph ensemble version of equation 
(fT"3")) . This is obtained by inserting the explicit expression for ((/i°) 2 ) in terms of the cavity marginals 
of the neighbouring vertices, and replacing the average over vertices with an average over degrees d: 

eM = lT £W-^^ ■ (36) 

The number of examples at the vertex 7 is once more to be averaged over 7 ~ Poisson(^). The 
subscript '00' indicates the top left element of the matrix, which determines the variance of h°. 

To be able to use equation Q36p. we again need to rewrite into a form that remains explicitly 
non-singular when 7 = and A — > 0. We separate the 7-dependence of the matrix inverse again and 
write, in slightly modified notation as appropriate for the graph ensemble case, O — Ylk=i XVkX = 
Md+[dn/ (7/cr 2 + A)]eoeQ , where e$ = (1, 0, . . . , 0). The 00 element of th e matrix inverse appearing 



above can then be expressed using the Woodbury formula (|Hageii 119891 ) as 



V t^i J Wl a + X )/( dK ) + e o M d e o (37) 

1 

" ( 1 /a 2 + X) + d K (M^ 1 )oo' 
The A — ¥ limit can now be taken, with the result 

e{v) = / f[dV k W(V k ) 1 \ . (38) 

\ , J , , 7/ & + dn{M j 00/ 



fe=i 



7 



This has a simple interpretation: the cavity marginals of the neighbours provide an effective Gaussian 
prior for each vertex, whose inverse variance is d(iW~ 1 )oo- 

The self-consistency equation (|35|) for W( V) and the expression (|38|) for the resulting Bayes error 
allow us to predict learning curves as a function of the number of examples per vertex, v, for arbitrary 
degree distributions p{d) of our random graph ensemble. For large graphs the predictions should 
become exact. It is worth stressing that such exact learning curve predictions have previously only 
been available in very specific, noise-free, GP regression scenarios, while our result for GP regression 
on graphs is applicable to a broad range of random graph ensembles, with arbitrary noise levels and 
kernel parameters. 

We note briefly that for graphs with isolated vertices (d = 0), one has to be slightly careful: al- 
ready in the definition of the covariance function @ one should replace D — s- D + 5I to avoid division 
by zero, taking 5 — > at the end. For d = one then finds in the expression (f3"8"|) that (M~ 1 )qq = 
l/(co<5), where Co is defined before (|28|) . As a consequence, k(S + d)(M~ 1 ) 00 — k8(M~ 1 ) 00 = k/cq- 
This is to be expected since isolated vertices each have a separate Gaussian prior with variance co/fc. 

Equations (|35|) and ([38} still require the normalisation constant, k. The simplest way to calculate 
this is to run the population dynamics once for k = 1 and v = 0, i.e. an unnormalised kernel and 
no training data. The result for e then just gives the average (over vertices) prior variance. With 
k set to this value, one can then run the population dynamics for any v to obtain the Bayes error 
prediction for GP regression with a globally normalised kernel. 

Comparisons between the cavity prediction for the learning curves, numerically exact simulated 
learning curves and the results of the eigenvalue approximation are shown in Figure [3] (left, centre 
and right), for regular, Poisson and generalised random graphs with power law degree distributions 
respectively. As can be seen the cavity predictions greatly outperform the eigenvalue approximation 
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and are accurate along the whole length of the curve. This confirms our expectation that the 
cavity approach will become exact on large graphs, although it is remarkable that the agreement is 
quantitatively so good already for graphs with only five hundred vertices. 

4.2 Predicting prior variances 

As a by-product of the cavity analysis for globally normalised kernels we note that in the cavity 
form of the Bayes error in equation ([38]) . the fraction (7/<7 2 + <iK;(iW"^ 1 )oo) -1 is the local Bayes error, 
i.e. the local posterior variance. By keeping track of individual samples for this quantity from the 
population dynamics approach, we can thus predict the distribution of local posterior variances. If 
we set v — 0, then this becomes the distribution of prior variances. The cavity approach therefore 
gives us, without additional effort, a prediction for this distribution. 

We can now go back to Section [3.2l and compare the cavity predictions to numerically simulated 
distributions of prior variances. The cavity predictions for these distributions are shown by the black 
lines in Figure [21 The cavity approach provides, in particular, detailed information about the tail of 
the distributions as shown in the insets. There is good agreement between the predictions and the 
numerical simulations, both regarding the general shape of the variance distributions and the fine 
structure with a number of non-trivial peaks and troughs. The residual small shifts between the 
predictions and the numerical results for a single instance of a large graph are most likely due to 
finite size effects: in a finite graph, the assumption of a tree-like local structure is not exact because 
there can be rare short cycles; also, the long cycles that the cavity method ignores because their 
length diverges logarithmically with V will have an effect when V is finite. 

4.3 Local normalisation 

We now extend the cavity analysis for the learning curves to the case of locally normalised random 
walk kernels, which, as argued above, provide more plausible probabilistic models. In this case the 
diagonal entries of the normalisation matrix 7C are defined as 

* = / dff?P(f), (39) 

where P(f) is the GP prior with the unnormalised kernel C . This makes clear why the locally nor- 
malised kernel case is more challenging technically: we cannot calculate the normalisation constants 
once and for all for a given random graph ensemble and set of kernel parameters p and a as we did 
for k in the globally normalised scenario. Instead we have to account for the dependence of the Ki 
on the specific graph instance. 

On a single graph instance, this stumbling block can be overcome as follows. One iterates the 
cavity updates (|32[) for the unnormalised kernel and without the training data (i.e. setting n — 1 and 
ji = 0). The local Bayes error at vertex i, given by the i-th term in the sum from (|30|) . then gives 
us Ki. Because ji — 0, one has to use the Woodbury trick as before to get well-behaved expressions 
in the limit where the auxiliary parameter A — > 0, as explained after (13T>1) . 

Once the Ki have been determined in this way, one can use them for predicting the Bayes error 
for the scenario we really want to study, i.e. using a locally normalised kernel and incorporating 
the training data. The relevant partition function is the analogue of ([28| for local normalisation. 
Dropping the prefactors, the resulting Z can be written as 

f nd^nd^nexp f-^E^^-^^^+iE^ 

J q=0 q=l i \ q=0 7l ' q=l 

x IT cx p f-iE^r+^r 1 ) 
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where we have rescaled hf to d\ h\ and hi to d 1 J 2 K i X ^ 2 h\. Given that the Ki have already 
been determined, this is a graphical model for which marginals can be calculated by iterating to a 
fixed point the equations for the cavity marginals: 



f J] dh k dh k exp l-i^K' 1 + K^T 1 )) P&k(hk,h k \x) 

keAf(j)\i \ q=l / 



(41) 



As in Section 14.11 these update equations are solved by cavity marginals of complex Gaussian form, 
and so we can simplify them to updates for the covariance matrices: 

= ( Otoci - E X ^Lx) ■ (42) 

Here X is defined as in equation (|33[) and Oioc,j is the obvious analogue of Oj defined in equation 
(|33[> : specifically, k is replaced by Kj. Once the update equations have converged, one can calculate 
the Bayes error from a similarly adapted version of (f30|) . 

The above procedure for a single fixed graph now has to be extended to the case of an ensemble 
of large random graphs characterised by some degree distribution p(d). The outcome of the first 
round of cavity updates, for the unnormalised kernel without training data, is then represented by a 
distribution of cavity covariances V , while the second one gives a distribution of cavity covariances 
VJoc for the locally normalised kernel, with training data included. Importantly, these message 
distributions are coupled to each other via the graph structure, so we need to look at the joint 
distribution W{V\ OCl V). 



Detailed analysis using the replica method ([Urrv and Sollichl . 120121) shows that the correct fixed 



point equation updates the V-messages as in the globally normalised case with 7 = 0. The second 
set of local covariances, V\ OC: are then updated according to (|42"T) . with a normaliser calculated using 
the marginals from the d — 1 V-covariances and an additional 'counter flow' covariance generated 
from W^V) = j dV\ oc W(V, V\ oc ), subject to the constraint that the neighbours local marginals are 
consistent. We find in practice that the consistency constraint can be dropped and the fixed point 
equation for the distribution of the two sets of messages can be approximated by 

W(V loc , V) = /j2 / n dV fc dMoc,fedV d J] W(V loCik , V k )W(V d ) 

\ d fc=i fc=i 

x s |moc - (o, OC;j J2 xv k?o C xJ j * (v - (o - E xv k x 

One sees that if one marginalises over V\ oc , then one obtains exactly the same condition on H^(V) 
as before in the globally normalised kernel case (but with k = 1 and v = 0), see (|35|) . This reflects 
the fact that the cavity updates for the first set of messages on a single graph do not rely on any 
information about the second set. The first delta function in (|4"5|) corresponds to the fixed point 
condition for this second set of cavity updates. This condition depends, via the value of the local K, 
on the V-cavity covariances: 

k = (44) 

It may seem unusual that d copies of V enter here; Vd represents the cavity covariance from the 
first set that is received from the vertex to which the new message V[ oc is being sent. While this 
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'counterflow' appears to run against the basic construction of the cavity or belief propagation method, 
it makes sense here because the first set of cavity messages (or equivalently the distribution W(V)) 
reaches a fixed point that is independent of the second set, so the counterflow of information is only 
apparent. The reason why knowledge about Vd is needed in the update is that n is the variance of 
a full marginal rather than a cavity marginal. 

Similarly to the case of global normalisation, (|43[) can be solved by looking for a fixed point of 
W(V[ oc , V) using population dynamics. Updates are made by first updating V using equation (f3Tj) 
and then updating V| oc using (j4Tj) with k = Ki replaced by (|44| . 

Once a fixed point has been calculated for the covariance distribution we apply the Woodbury 
formula to (|30[) in a similar manner to Section 14.11 to give the prediction for the learning curve for 
GP regression with a locally normalised kernel. The result for the Bayes error becomes 

«= Q>W) / f[dV loc , k dV k f[W(V loc , k ,V k ) ' ! ) ■ (45) 

\ d J k=i k=i ^1° +y M djocho/(M d )oo/ 7 

Learning curve predictions for GPs with locally normalised kernels as they result from the cavity 
approach described above are shown in Figure [5] The figure shows numerically simulated learning 
curves and the cavity prediction, both for Poisson random graphs (left) and power law generalised 
random graphs (centre) of size V — 500. As for the globally normalised case one sees that the cavity 
predictions appear quantitatively very accurate even with the simplified update equation (|43|) . They 
capture all aspects of learning curve both qualitatively and quantitatively, including e.g. the shoulder 
in the curves from disconnected single vertices, a feature discussed in more detail below. 

5. A qualitative comparison of learning with locally and globally 
normalised kernels 

The cavity approach we have developed gives very accurate predictions for learning curves for GP 
regression on graphs using random walk kernels. This is true both for global and for local normal- 
ization of the kernel. We argued in Section 13.21 that the local normalization is much more plausible 
as a probabilistic model, because it avoids variability in the local prior variances that is non-trivially 
related to the local graph structure and so difficult to justify from prior knowledge. We now compare 
what the qualitative effects of the two different normalizations are on GP learning. 

It is not a simple matter to say which kernel is 'better', the locally or globally normalised one. 
Since we have dealt with the matched case, where for each kernel the target functions are sampled 
from a GP prior with that kernel as covariance function, it would not make sense to say the better 
kernel is the one that gives the lower Bayes error for given number of examples, as the Bayes error 
reflects both the complexity of the target function and the success in learning it. A more definite 
answer could be obtained only empirically, by running GP regression with local and global kernel 
normalization on the same datasets and comparing the prediction errors and also the marginal 
data likelihood. The same approach could also be tried with synthetic datasets generated from GP 
priors that are mismatched to both priors we have considered, defined by the globally and locally 
normalised kernel, though justifying what is a reasonable choice for the prior of the target function 
would not be easy. 

While a detailed study along the lines above is outside the scope of this paper, we can nevertheless 
at least qualitatively study the effect of the kernel normalisation, to understand to what extent the 
corresponding priors define significantly different probabilistic models. Figure [5] (right top and 
bottom) overlays the learning curves for global and local kernel normalisations, for a Poisson and 
a power law generalised random graph respectively. There are qualitative differences in the shapes 
of the learning curves, with ones for the locally normalised kernel exhibiting a shoulder around 
v = 2. This shoulder is due to the proper normalization of isolated vertices to unit prior variance; 
by contrast, as shown earlier in Figure [5] (left), global normalization gives too small a prior variance 
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v = N/V 

Figure 5: (Left) Learning curves for GP regression with locally normalised kernels with p = 10, a = 2 
on Poisson random graphs with mean degree 3, for a range of noise levels a 2 . Solid lines 
with filled circles: numerically simulated learning curves for graphs of size V — 500, solid 
lines with triangles: cavity predictions (see Section |4~TT) . (Left inset) Dashed lines show 
single vertex contributions to the learning curve (solid line with filled circles). (Centre) As 
(left) for power law generalised random graphs with exponent 2.5 and cutoff 2. (Right top) 
Comparison between learning curves for locally (squares) and globally (circles) normalised 
kernels for Poisson random graphs. (Right bottom) As (right top) for power law random 
graphs. 
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to such vertices. The inset in Figure [5] (left) shows the expected learning curve contributions from 
all locally normalised isolated vertices (single vertex subgraphs) as dotted lines. After the GP learns 
the rest of the graph to a sufficient accuracy, the single vertex error dominates the learning curve 
until these vertices have typically seen at least one example. Once this point has been passed, the 
dominant error comes once more from the giant connected component of the graph, and the GP 
learns in a similar manner to the globally normalised case. A similar effect, although not plotted, is 
seen for the generalised random graph case. 

We can extend the scope of this qualitative comparison by examining how a student GP with 
a kernel with one normalisation performs when learning from a teacher with a kernel with the 
other normalisation. This is a case of model mismatch; our theory so far does not extend to this 
scenario, but we can obtain learning curves by numerical simulation. Figure |6] (left) shows the case 
of GP students with a globally normalised kernel for a teacher with a locally normalised kernel. 
The learning curves for the mismatched scenario are very different from those for the matched case 
(Figures [5] and [3]), showing an increase in error as v approaches unity. The resulting maximum in the 
learning curve again emphasises that the two choices of normalization produce distinctly different 
probabilistic models. Similar behaviour can be observed for the case of power law generalised random 
graphs, shown in Figure |6] (centre). In both cases close inspection (see Appendix [Bj shows that the 
error maximum is caused by 'dangling edges' of the graph, i.e. chains of vertices (with degree two) 
extending away from the giant graph component and terminating in a vertex of degree one. 

As a final qualitative comparison between globally and locally normalised kernels, Figure |H] 
(right) shows the variance of local posterior variances. This measures how much the local Bayes 
error typically varies from vertex-to-vertex, as a function of the data set size. Plausibly one would 
expect that this error variance is low initially when prediction on all vertices is equally uncertain. For 
large datasets the same should be true because errors on all vertices are then low. In an intermediate 
regime the error variance should become larger because examples will have been received on or near 
some vertices but not others. As Figure [6] (right) shows, for kernels with local normalization we find 
exactly this scenario, both for Poisson and power law random graphs. The error variance is low for 
small v — N/V, increasing to a peak at v ps 0.2 and finally decreasing again. 

These results can now be contrasted with those for globally normalised kernels, also displayed 
in Figure [5] (right). Here the error variance is largest at v = and decays from there. This 
means that the initial variance in the local prior variances is so large that any effects from the 
uneven distribution of example locations in any given data set remains subdominant throughout. 
We regard this as another indication of the probabilistically implausible character of the large spread 
of prior variances caused by global normalization. 

6. Conclusions and further work 

In this paper we studied random walk kernels and their application to GP regression. We began, in 
section [21 by studying the random walk kernel, with a focus on applying this to c£-regular trees and 
graphs. We showed that the kernel exhibits a rather complicated approach to the fully correlated 
limit; this limit is reached only beyond a graph-size dependent threshold for the kernel range p/a, 
where loops become important. If p/a is large but below this threshold, the kernel reaches a non- 
trivial limiting shape. 

In Section[3]we moved on to the application of random walk kernels to GP regression. We showed 
(Section 13. 2\i that the more typical approach to normalisation, i.e. scaling the kernel globally to a 
desired average prior variance, results in a large spread of local prior variances that is related in a 
complicated manner to the graph structure; this is undesirable in a prior. We suggested as a simple 
remedy to perform local normalisation, where the raw kernel is normalised by its local prior variance 
so that the prior variance becomes the same at every vertex. 

In order to get a deeper understanding of the performance of GPs with random walk kernels 
we then studied the learning curves, i.e. the mean Bayes error as a function of the number of ex- 
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Figure 6: (Left) Numerically simulated learning curves for a GP with a globally normalised kernel 
with p — 10 and a = 2, on Poisson random graphs with mean degree 3 for a range of noise 
levels. The teacher GP has a locally normalised kernel with the same parameters. (Centre) 
As (left) but for power law generalised random graphs with exponent 2.5 and cutoff 2. 
(Right top) Error variance for GP regression with locally (stars) and globally (circles) 
normalised kernels against v, on Poisson random graphs, and for matched learning with 



p = 10, a 
graphs. 



2, a = 0.1. (Right bottom) As (top) but for power law generalised random 
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amples. We began i n sec tion 13.41 by applying a previous approximation due to ISollichl (|1999alk 
Malzahn and Qpperl ( 2005 ) to the case of discrete inputs that are vertices on a graph. We demon- 



strated numerically that this approximation is accurate only for small and large number of training 
examples per vertex, v — N/V, while it fails in the crossover between these two regimes. The outline 
derivation of this approximation suggested how one might improve it: one has to exploit fully the 
structure of random graphs, using the cavity method, thus avoiding approximating the average over 
data sets. In Section |4] we implemented this programme, beginning in Section 14. II with the case of 
global normalisation. We showed that by Fourier transforming the prior and introducing 2p addi- 
tional variables at each vertex we can rewrite the partition function in terms of a complex- valued 
Gaussian graphical model, where the marginals that are required to calculate the Bayes error can 
be found using the cavity method, or equivalently belief propagation. In Section 14.31 we tackled the 
more difficult scenario of a locally normalised kernel. This required two sets of cavity equations. 
The first serves to calculate the local normalisation factors. The second one then combines these 
with the information about the data set to find the local marginal distributions. 

Finally in Section [5] we qualitatively compared GPs with kernels that are normalised globally 
and locally. We showed that learning curves are indeed qualitatively different. In particular, local 
normalisation leads to a shoulder in the learning curves owing to the correct normalisation of the 
single vertex disconnected graph components. We also considered numerically calculated mismatch 
curves where the teacher was set to be a GP with a locally normalised kernel and the student was a 
GP with a globally normalised kernel. This causes a maximum in the learning curves, caused by the 
large differences in the teacher and student priors. Lastly we looked at the variance among the local 
Bayes errors, for GPs with both globally and locally normalised kernels. Plausibly, locally normalised 
kernels lead to this error variance being maximal for intermediate data set sizes. This reflects the 
variation in number of examples seen at or near individual vertices. For globally normalised kernels, 
the error variance inherited from the spread of local prior variances is always dominant, obscuring 
any signatures from the changing 'coverage' of the graph by examples. 

In further work we intend to extend the cavity approximation of the learning curves to the case 
of mismatch, where teacher and student have different kernel hyperparameters. It would also be in- 
teresting to apply the cavity approximation of the learn ing curves of GPs w i th random walk kernels 
to mo re general random graphs, like those considered in Rogers et al.l ( 201 0h: lKuhn and van Mourild 
(|201l[ ). This would enable us to consider graphs exhibiting some community structure. Looking fur- 
ther ahead, preliminary work has shown that it should be possible to extend the cavity learning curve 
approximation to the problem of graph mismatch, where the student has incomplete information on 
the graph structure of the teacher. 
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Appendix A. Random walk kernel on d-regular trees 

We detail here how to derive the limiting kernel from ([5]) in Section 12.11 and how to calculate the 
scaling of the learning curve stated in Section 13.4.11 



A.l Large-p limit from heat kernel results 

In I Chung and Yau (Il999h the authors considered heat kernels on graphs. The heat kernel for a 
graph with normalised Laplacian L (see Section [2] in the main text) is given by 



H = exp(-tL) , t > 0, 



(46) 



\a 2 . 



Chung and Yau (1999) gave results 



This is exactly the diffusion kernel given in (fl} with t 
for heat kernels on d-regular trees. Their method involves treating the tree as a covering of a path, 
which is closely related to the mapping onto a one-dimensional lattice discussed in Section 12.11 The 



eigenvectors and eigenvalues of L are then found within this mapping. If we call the unnormalized 
form of the random walk kernel s we considered i n the main text again C = (/ — a _1 Z/) p , then 
we can directly use the results of IChung and Yaul (|l999h . simply by modifying the function that is 
applied to each eigenvalue A of L from exp(— tX) to (1 — X/a) p . This gives, for the unnormalised 
random walk kernel element between two vertices on the tree at distance I, 



~ 2(d - l)d r , [1 - (1 - 2cos(x)Vd^T/d)/a\ p sm 2 (x) 
G ,p = I ax 



d 2 -4(d- l)cos 2 (x) 



C 



i>i,p 



7r(d- 1)V2- 



ax [1 - (1 - 2 cos(ie) Vd^T/d) /a] p 

sm(x)[(d - 1) sin((Z + l)x) - sin((l - l)x)} 
X d 2 -A{d- l)cos 2 (x) ' 



(47) 
(48) 



In the limit p — > oo, the factor in square brackets in both expressions means that only values of x 
up to 0(p^ 1 ^ 2 ) contribute significantly to the integral. One can then expand sin(a;) ~ x linearly 
everywhere, and similarly cos(x) w 1. Calculating Ci jP in this way for p — > oo gives the result (J5j> 

for Cj,p = C;,p/Co, P - 



A. 2 Scaling form for large p 

In Section 13.4.11 we derived the learning curve decay due to the first example, given in (|19[) . As 
explained there, to understand how this behaves for large p one needs to know how G\ p approaches 
its limiting form (|5|). In the outline derivation in the previous subsection we saw that in order to 
obtain this limiting form one expands not just sin(x), but also sin((/ — l)x) and sin((Z + l)x) linearly 
because x = 0(p~ 1 ^ 2 ) is small. For the two sine factors this will break down once /p -1 / 2 is of order 
one, i.e. for large enough I. This suggests looking at G\ p as function of I' = Ip^ 1 / 2 . To focus on the 
relevant part of the integral one can similarly transform the integration variable to x' — xp 1 ^ 2 . For 
p — > oo at constant I' , the integral for G\p in (l48l) then becomes proportional to 

{d-l)- 1 ' 2 dx' 'e- x ' 2 ^ T ^ da ^x'{d-2)sm{l'x') (49) 
Jo 

Integration by parts then gives Ci tP oc C^ p oc (d— \)~ l l 2 V exp(— 2 Sd-i )' Comparing with the large-/ 
limit of ([5]), Q p oc (d — X)~ l l 2 l\ shows that the effect of finite p is contained in the exponential 
(Gaussian) cutoff factor which becomes significant for /' of order unity, i.e. / = 0(p 1 ^ 2 ), as stated in 
the main text. 

Visually, the cutoff effects for large I and p can be seen more clearly by plotting not Ci. p directly 
but rather Ri p the unormalised version of Ri. p as defined in Section f3. 4.11 As vi — d(d — 1) /_1 for 



25 



Urry and Sollich 




Figure 7: Plot demonstrating the p — > oo scaling from of ppi lP , the normalised version of i?/. 
Dashed line shows the analytically found functional from of the scaling. 



I > 1, the additional y/vi factor just removes the decay with (d — 1) l l 2 from Ci lP . From the above 
discussion, we then expect to find for large / that Ri jP oc I exp(— 2 ^f_ 1 )■ A plot of numerical results 

for a suitably normalised version of Ri iP (see below), plotted against Ip^ 1 ! 2 for increasing p, is shown 
in Figure [7J There is a clear trend towards the predicted asymptotic behaviour for large p (dashed 
line). 

It may be somewhat surprising that the cutoff lengthscale that appears in the analysis above is 
I = 0(p 1 ^ 2 ), which for large p is much smaller than the typical kernel range p/a. To understand this 
intuitively, one can go back to (|6]) for Ri >v and use that this is essentially unnormalised diffusion. 
Taking Ri p as the unormalised version of Ri p again letting pi p — Ri. p j~ p with 7 = (1 — 1/a) + 
2\[d — 1/ (ad) to re-establish normalisation, one finds that pi iP evolves almost according to a diffusion 
process in 'time' p, except that probability conservation is broken at I = and 2 = 1. In the (leaky) 
diffusion process pi iP the typical lengthscale I should then scale with time as p 1 ^ 2 , exactly as we found 
above. This diffusion interp retation could also b e used to reproduce the quantitative scaling form, 
by adapting the methods of iMonthus and Texier (1996) where the authors study a one dimensional 
walk with a reflective boundary to compute the large-p scaling. The normalised version of R^ p 
shown in Figure [7] is in fact ppi, p . 



Appendix B. Analysis of the mismatch learning curves 

In this appendix we suggest a way of understanding the mismatch maximum seen in the learning 
curves plotted in Figure |H] (left) and (centre). We begin by considering the mean of the GP posterior, 
i.e. the prediction function. If we arrange the predictive means at each vertex into a vector /, then 
from © this can be written as 

f = K v (K + a 2 I)- 1 y (50) 

where (Ky)j V = Cj_ Xv for v = 1,...,N and j = 1, . . . , V and = C Xfi . Xv for (i, v = 1,...,N. 
(Recall that is the location of the /z-th training input, which on a graph is just a label in the 
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range {1, . . . , V}.) We can rewrite this in the form / = Mz where 

M = K v {K + a 2 I)- 1 ' 2 
z = (K + a 2 I)- 1 / 2 y. 



(51) 
(52) 



One sees that M represents teacher-independent aspects of /. The vector of 'pseudo-training out- 
puts' z has been defined so that in the matched case, it obeys (zz T ) = I. We think of z as 
pseudo-training outputs because if its components are sampled as independent and identically 
distributed unit variance Gaussian random variables, then / = Mz will have the correct distri- 



average, a sum of squares of the corresponding entries in the effective prediction vectors. 

For the case of mismatch, e.g. because student and teacher use differently normalised kernels, 
the only change is in the statistics of the z. Their covariance matrix (zz T ) will no longer simply 
be the identity matrix, and so act as a re-weighting of the prediction vectors. To understand our 
numerical simulations, we considered a value of v around the maximum in the mismatched learning 
curve, listed the largest (diagonal) entries of zz T and plotted their corresponding prediction vectors. 
These prediction vectors were generally localised around dangling ends of the graph, substantiating 
our claim in the main text that it is from these graph regions that the major contribution to the 
learning curve maximum arises. 




27 



Urry and Sollich 



References 

S. Amari, N. Fujita, and S. Shinomoto. Four types of learning curves. Neural Comput., 4(4):605-618, 
1992. 

C. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA., 2nd edition, 
2007. 

T. Britton, M. Deijfen, and A. Martin-Loeff. Generating simple random graphs with prescribed 
degree distribution. Journal of Statistical Physics, 124(6):1377-1397, 2006. 

F. K. Chung. Spectral graph theory, volume 92 of Regional conference series in mathematics. Amer- 
ican Mathematical Society, Dec 1996. 

F. K. Chung and S. T. Yau. Coverings, heat kernels and spanning trees. Electronic Journal of 
Combinatorics, 6:R12, 1999. 

N. Cristianini and J. Shawe- Taylor. An Introduction to Support Vector Machines and Other Kernel- 
Based Learning Methods. Cambridge University Press, 2000. 

P. Erdos and A. Renyi. On random graphs, I. Publicationes Mathematicae (Debrecen), 6:290-297, 
1959. 

J. Freeman and D. Saad. Dynamics of on-line learning in radial basis networks. Phys. Rev. E, 56 
(1):907-918, Jul 1997. 

M. G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of Machine 
Learning Research, 2(2):299-312, 2002. 

W.W. Hager. Updating the inverse of a matrix. SIAM Rev., 31(2):221-239, 1989. 

M. S. Handcock and M. L. Stein. A Bayesian analysis of Kriging. Technometrics, 35(4):403-410, 
1993. 

D. Haussler, M. Kearns, H. S. Seung, and N. Tishby. Rigorous learning curve bounds from statistical 
mechanics. Mach. Learn., 25(2-3):195-236, 1996. 

J. P. C. Klcijncn. Kriging metamodeling in simulation: A review. European Journal of Operational 
Research, 192(3):707 - 716, 2009. 

R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures. In C. Sam- 
mut and Hoffmann A. G., editors, Proceedings of the 19th International Conference on Machine 
Learning (ICML), pages 315-322, San Francisco, CA, USA., 2002. Morgan Kaufmann. 

R. Kiihn and J. van Mourik. Spectra of modular and small-world matrices. J. Phys. A, 44(16): 
165205, 2011. 

D. Malzahn and M. Opper. Learning curves and bootstrap estimates for inference with Gaussian 
processes: A statistical mechanics study. Complexity, 8(4):57-63, 2003. 

D. Malzahn and M. Opper. A statistical physics approach for the analysis of machine learning 
algorithms on real data. J. Stat. Mech. Theor. Exp., 2005(11):P11001, 2005. 

B. D. McKay. The expected eigenvalue distribution of a large regular graph. Linear Algebra and its 
Applications, 40:203-216, 1981. 

J. R. Mcinhold and N. D. Singpurwalla. Understanding the Kalman filter. The American Statistician, 
37(2):123-127, 1983. 



28 



Gaussian processes on random graphs 



M. Mezard and G. Parisi. The Bethe lattice spin glass revisited. Eur. Phys. J. B, 20(2):217-233, 
2001. 

M. Mezard and G. Parisi. The cavity method at zero temperature. Journal of Statistical Physics, 
lll(l-2):l-34, 2003. 

M. Mezard, G. Parisi, and M. Virasoro. Spin Glass Theory and Beyond, volume 9 of World Scientific 
lecture notes in physics. World Scientific, Singapore, 1987. 

C. Monthus and C. Texier. Random walk on the Bethe lattice and hyperbolic Brownian motion. 
Journal of Physics A: Mathematical and General, 29:2399-2409, 1996. 

K. R. Miiller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An introduction to kernel-based 
learning algorithms. IEEE Transactions on Neural Networks, 12(2):181-201, 2001. 

R. M. Neal. Bayesian learning for neural networks, volume 118 of Lecture notes in statistics. 
Springer- Verlag, Berlin, Germany, 1996. 

M. Opper and D. Haussler. Bounds for predictive errors in the statistical mechanics of supervised 
learning. Phys. Rev. Lett, 75(20):3772-3775, 1995. 

M. Opper and D. Malzahn. A variational approach to learning curves. In T. G. Dietterich, S. Becker, 
and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14, 
pages 463-469, Cambridge, MA, USA., 2002. The MIT Press. 

M. Opper and F. Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. 
In M. S. Kearns, S. A. Solla, and D A Cohn, editors, Advances in Neural Information Processing 
Systems, volume 11, pages 302-308, Cambridge, MA, USA., 1999. The MIT Press. 

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Nov 
2005. 

T. Rogers, C. Vicente, K. Takeda, and I. Castillo. Spectral density of random graphs with topological 
constraints. J. Phys. A, 43(19):195002, 2010. 

H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Phys 
Rev A, 45(8):6056-6091, 1992. 

A. Smola and R. Kondor. Kernels and regularization on graphs. In B Scholkopf and M. K. Warmuth, 
editors, Led. Notes Artif. Int., volume 2777, pages 144-158, 2003. 

P. Sollich. Learning curves for Gaussian processes. In M. S. Kearns, S. A. Solla, and D A Cohn, edi- 
tors, Advances in Neural Information Processing Systems, volume 11, pages 344-350, Cambridge, 
MA, USA., 1999a. The MIT Press. 

P. Sollich. Approximate learning curves for Gaussian processes. In Ninth international conference 
on artificial neural networks (ICANN99), pages 437-442, Edison, NJ, USA., 1999b. Institute of 
Electrical Engineers Inspec inc. 

P. Sollich. Gaussian process regression with mismatched models. In S Becker T G Dietterich and 
Z Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14, pages 
519-526, Cambridge, MA, USA., 2002. The MIT Press. 

P. Sollich and A. Halees. Learning curves for Gaussian process regression: Approximations and 
bounds. Neural Comput., 14(6):1393-1428, 2002. 



29 



Urry and Sollich 



P. Sollich and C. K. I. Williams. Using the equivalent kernel to understand Gaussian process 
regression. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information 
Processing Systems, volume 17, pages 1313-1320, Cambridge, MA, USA., 2005. The MIT Press. 

P. Sollich, M. J. Urry, and C. Coti. Kernels and learning curves for Gaussian process regression on 
random graphs. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 
editors, Advances in Neural Information Processing Systems, volume 22, pages 1723-1731, Red 
Hook, NY, USA., 2009. Curran Associates Inc. 

M. J. Urry and P. Sollich. Exact learning curves for Gaussian process regression on large random 
graphs. In J. Lafferty, C. K. I. Williams, J. Shawe- Taylor, R.S. Zemel, and A. Culotta, editors, 
Advances in Neural Information Processing Systems, volume 23, pages 2316-2324, Red Hook, NY, 
USA., 2010. Curran Associates Inc. 

M. J. Urry and P. Sollich. Replica theory for learning curves for Gaussian processes on random 
graphs. To be published in Journal of Physics A, 2012. 

T. L. H. Watkin, A. Rau, and M. Biehl. The statistical mechanics of learning a rule. Rev. Mod. 
Phys., 65(2):499-556, 1993. 

C. Williams and F. Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes. 
Mach. Learn., 40:77-102, 2000. 



30 



